[Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with --enforce-eager#31643
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug that causes models to produce incoherent output when running on the CPU backend with --enforce-eager. The root cause was correctly identified: several CustomOp subclasses lacked a forward_cpu implementation, causing a fallback to C++ kernels that have behavioral inconsistencies with their PyTorch native counterparts. The fix, which involves adding explicit forward_cpu methods to RotaryEmbedding, RMSNorm, GemmaRMSNorm, and RMSNormGated that delegate to forward_native, is sound and directly resolves the issue. The changes are consistent, well-targeted, and ensure that the correct, native PyTorch implementations are used on the CPU, restoring model correctness. The implementation is clean and follows existing patterns in the codebase.
Add explicit forward_cpu methods to CustomOp subclasses that delegate to forward_native, ensuring CPU backend uses PyTorch native implementation instead of buggy CPU C++ kernels when custom_ops='all'. Classes fixed: - RotaryEmbedding (rotary_embedding/base.py) - RMSNorm (layernorm.py) - GemmaRMSNorm (layernorm.py) - RMSNormGated (layernorm.py) Fixes vllm-project#31626 Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
7f04646 to
b714a1c
Compare
bf11a8e to
6f24b52
Compare
Add explicit forward_cpu methods to CustomOp subclasses that delegate to forward_native, ensuring CPU backend uses PyTorch native implementation instead of buggy CPU C++ kernels when custom_ops='all'. Classes fixed: - RotaryEmbedding (rotary_embedding/base.py) - RMSNorm (layernorm.py) - GemmaRMSNorm (layernorm.py) - RMSNormGated (layernorm.py) Fixes vllm-project#31626 Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
6f24b52 to
b75b49a
Compare
|
/cc @ProExpertProg PTAL. |
|
Hi @rickychen-infinirc Thanks for the catch. The root cause is RMSNorm accepts non-contiguous inputs after #28103 but we didn't added it to the CPU kernels. It's okay to fallback most of custom ops to torch native implementations because they are not performance-critical on CPU and can be compiled by torch compile in most cases. We can dispatch them to vllm/vllm/model_executor/custom_op.py Lines 69 to 71 in d5503ca One special case is RoPE, I prefer to use the CPU custom kernel in eager mode. |
- Change CustomOp.forward_cpu() default to forward_native instead of forward_cuda, as most CPU custom kernels are not performance-critical and can have compatibility issues - Remove redundant forward_cpu() from RMSNorm, GemmaRMSNorm, RMSNormGated since they now inherit the base class behavior Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
7afc967 to
0bd69ef
Compare
|
@bigPYJ1151 Thanks for the review and the clarification on the root cause! I've updated the PR based on your suggestion:
Tested with Qwen3-0.6B on CPU with |
|
Hi @rickychen-infinirc, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
86ce853 to
0bd69ef
Compare
|
Hi @rickychen-infinirc, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
…nforce-eager (vllm-project#31643) Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
…nforce-eager (vllm-project#31643) Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
…nforce-eager (vllm-project#31643) Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
…nforce-eager (vllm-project#31643) Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…nforce-eager (vllm-project#31643) Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
Summary
Fix gibberish output on CPU backend when
--enforce-eageris enabled.resolve #31626
When running vLLM on the CPU backend with
--enforce-eager, models may produce incoherent or repetitive outputs.This happens because
enforce_eager=Truesetscustom_ops="all", enabling CustomOp dispatch on CPU. In this mode,CustomOp.dispatch_forward()selectsforward_cpu()implementations when available.Several CustomOp subclasses did not define
forward_cpu(), causing them to fall back to the base class behavior, which delegates toforward_cuda(). On CPU, this path invokes the C++ CPU kernels, whose behavior diverges from the PyTorch native implementations in certain cases, leading to incorrect computations and degraded output quality.Root cause
Missing
forward_cpu()implementations in:RotaryEmbeddingRMSNormGemmaRMSNormRMSNormGatedFix
Add explicit
forward_cpu()methods that delegate toforward_native().This ensures that, when running on CPU with custom ops enabled, these layers consistently use the PyTorch native implementations, restoring correct behavior while keeping the existing execution model unchanged.
Test Plan
Tested with Qwen3-0.6B on CPU with
--enforce-eager: